6 research outputs found
PointASNL: Robust Point Clouds Processing using Nonlocal Neural Networks with Adaptive Sampling
Raw point clouds data inevitably contains outliers or noise through
acquisition from 3D sensors or reconstruction algorithms. In this paper, we
present a novel end-to-end network for robust point clouds processing, named
PointASNL, which can deal with point clouds with noise effectively. The key
component in our approach is the adaptive sampling (AS) module. It first
re-weights the neighbors around the initial sampled points from farthest point
sampling (FPS), and then adaptively adjusts the sampled points beyond the
entire point cloud. Our AS module can not only benefit the feature learning of
point clouds, but also ease the biased effect of outliers. To further capture
the neighbor and long-range dependencies of the sampled point, we proposed a
local-nonlocal (L-NL) module inspired by the nonlocal operation. Such L-NL
module enables the learning process insensitive to noise. Extensive experiments
verify the robustness and superiority of our approach in point clouds
processing tasks regardless of synthesis data, indoor data, and outdoor data
with or without noise. Specifically, PointASNL achieves state-of-the-art robust
performance for classification and segmentation tasks on all datasets, and
significantly outperforms previous methods on real-world outdoor SemanticKITTI
dataset with considerate noise. Our code is released through
https://github.com/yanx27/PointASNL.Comment: To appear in CVPR 2020. Also seen in
http://kaldir.vc.in.tum.de/scannet_benchmark
LATR: 3D Lane Detection from Monocular Images with Transformer
3D lane detection from monocular images is a fundamental yet challenging task
in autonomous driving. Recent advances primarily rely on structural 3D
surrogates (e.g., bird's eye view) built from front-view image features and
camera parameters. However, the depth ambiguity in monocular images inevitably
causes misalignment between the constructed surrogate feature map and the
original image, posing a great challenge for accurate lane detection. To
address the above issue, we present a novel LATR model, an end-to-end 3D lane
detector that uses 3D-aware front-view features without transformed view
representation. Specifically, LATR detects 3D lanes via cross-attention based
on query and key-value pairs, constructed using our lane-aware query generator
and dynamic 3D ground positional embedding. On the one hand, each query is
generated based on 2D lane-aware features and adopts a hybrid embedding to
enhance lane information. On the other hand, 3D space information is injected
as positional embedding from an iteratively-updated 3D ground plane. LATR
outperforms previous state-of-the-art methods on both synthetic Apollo,
realistic OpenLane and ONCE-3DLanes by large margins (e.g., 11.4 gain in terms
of F1 score on OpenLane). Code will be released at
https://github.com/JMoonr/LATR .Comment: Accepted by ICCV2023 (Oral
M^2-3DLaneNet: Multi-Modal 3D Lane Detection
Estimating accurate lane lines in 3D space remains challenging due to their
sparse and slim nature. In this work, we propose the M^2-3DLaneNet, a
Multi-Modal framework for effective 3D lane detection. Aiming at integrating
complementary information from multi-sensors, M^2-3DLaneNet first extracts
multi-modal features with modal-specific backbones, then fuses them in a
unified Bird's-Eye View (BEV) space. Specifically, our method consists of two
core components. 1) To achieve accurate 2D-3D mapping, we propose the top-down
BEV generation. Within it, a Line-Restricted Deform-Attention (LRDA) module is
utilized to effectively enhance image features in a top-down manner, fully
capturing the slenderness features of lanes. After that, it casts the 2D
pyramidal features into 3D space using depth-aware lifting and generates BEV
features through pillarization. 2) We further propose the bottom-up BEV fusion,
which aggregates multi-modal features through multi-scale cascaded attention,
integrating complementary information from camera and LiDAR sensors. Sufficient
experiments demonstrate the effectiveness of M^2-3DLaneNet, which outperforms
previous state-of-the-art methods by a large margin, i.e., 12.1% F1-score
improvement on OpenLane dataset
Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis
Although recent point cloud analysis achieves impressive progress, the
paradigm of representation learning from a single modality gradually meets its
bottleneck. In this work, we take a step towards more discriminative 3D point
cloud representation by fully taking advantages of images which inherently
contain richer appearance information, e.g., texture, color, and shade.
Specifically, this paper introduces a simple but effective point cloud
cross-modality training (PointCMT) strategy, which utilizes view-images, i.e.,
rendered or projected 2D images of the 3D object, to boost point cloud
analysis. In practice, to effectively acquire auxiliary knowledge from view
images, we develop a teacher-student framework and formulate the cross modal
learning as a knowledge distillation problem. PointCMT eliminates the
distribution discrepancy between different modalities through novel feature and
classifier enhancement criteria and avoids potential negative transfer
effectively. Note that PointCMT effectively improves the point-only
representation without architecture modification. Sufficient experiments verify
significant gains on various datasets using appealing backbones, i.e., equipped
with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on
two benchmarks, i.e., 94.4% and 86.7% accuracy on ModelNet40 and ScanObjectNN,
respectively. Code will be made available at
https://github.com/ZhanHeshen/PointCMT.Comment: To appear in NIPS202
X4D-SceneFormer: Enhanced scene understanding on 4D point cloud videos through cross-modal knowledge transfer
The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences.
However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using singlemodal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation.
The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin.We release the code at https://github.com/jinglinglingling/X4D</p
X4D-SceneFormer: Enhanced scene understanding on 4D point cloud videos through cross-modal knowledge transfer
The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences.
However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using singlemodal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation.
The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin.We release the code at https://github.com/jinglinglingling/X4D</p